Skip to content

fix(peerclient): resolve base path per-call from live FrameTable#2585

Merged
dobrac merged 6 commits intomainfrom
lev-fix-switch-path
May 7, 2026
Merged

fix(peerclient): resolve base path per-call from live FrameTable#2585
dobrac merged 6 commits intomainfrom
lev-fix-switch-path

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented May 7, 2026

After a peer→storage transition, peerSeekable kept opening base against the original (uncompressed) path captured at construction. With the recent compression work, post-transition reads now target a compressed object, so the cached base resolved to a non-existent path.

P2P routing produced the stale binding (it captured a path while the build was still uncompressed); it also contains the fix. peerSeekable now holds (buildID, basic name, base provider, objType) and composes the actual storage path at base-open time using the CompressionType from the live FrameTable. The base seekable is reopened on ct change. No changes outside peerclient.

After a peer→storage transition, peerSeekable kept opening base against
the original (uncompressed) path captured at construction. With the
recent compression work, post-transition reads now target a compressed
object, so the cached base resolved to a non-existent path.

P2P routing produced the stale binding (it captured a path while the
build was still uncompressed); it also contains the fix. peerSeekable
now holds (buildID, basic name, base provider, objType) and composes
the actual storage path at base-open time using the CompressionType from
the live FrameTable. The base seekable is reopened on ct change. No
changes outside peerclient — no cache key, no chunker, no public
storage interface.

Also hoist the transition emit to the top of OpenRangeReader: returning
PeerTransitionedError no longer wastes a base open before the caller
swaps the header and retries.
@cla-bot cla-bot Bot added the cla-signed label May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 7 Tests Failed:

Tests completed Failed Passed Skipped
2594 7 2587 7
View the full list of 17 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Flake rate in main: 55.77% (Passed 46 times, Failed 58 times)

Stack Traces | 5.15s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:45: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:45
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (5.15s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 68.83% (Passed 48 times, Failed 106 times)

Stack Traces | 2.94s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (2.94s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.77% (Passed 52 times, Failed 139 times)

Stack Traces | 206s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (205.88s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 72.97% (Passed 50 times, Failed 135 times)

Stack Traces | 5.78s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ibc0urf9zigw0j8ug2x24
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1347}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ibc0urf9zigw0j8ug2x24
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1348}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1349}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 23:27:28 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ibc0urf9zigw0j8ug2x24
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (5.78s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 52.80% (Passed 76 times, Failed 85 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 57.41% (Passed 46 times, Failed 62 times)

Stack Traces | 8.97s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
Executing command python in sandbox i6ce42qyx2dqe1dhm0u0n
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1251}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
Executing command python in sandbox iwbng3lrtzebcdmn0t046
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (8.97s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 52.53% (Passed 47 times, Failed 52 times)

Stack Traces | 7.51s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox ied15e9t929k9c5t91a3l
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1269}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::

Flake rate in main: 50.00% (Passed 47 times, Failed 47 times)

Stack Traces | 6.69s run time
=== RUN   TestBindLocalhost/bind_::
=== PAUSE TestBindLocalhost/bind_::
=== CONT  TestBindLocalhost/bind_::
Executing command python in sandbox ilyrh747eo5pif12ii8js
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1269}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::
        	Messages:   	Unexpected status code 502 for bind address ::
--- FAIL: TestBindLocalhost/bind_:: (6.69s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 58.93% (Passed 46 times, Failed 66 times)

Stack Traces | 7.4s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox igfto186g7ncgreqz91n9
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1263}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.40s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 58.93% (Passed 46 times, Failed 66 times)

Stack Traces | 7.2s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox iliylyr5gdw49ff9e0g56
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1263}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.20s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 46.60% (Passed 55 times, Failed 48 times)

Stack Traces | 1.22s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (1.22s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 51.06% (Passed 46 times, Failed 48 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 51.06% (Passed 46 times, Failed 48 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 51.06% (Passed 46 times, Failed 48 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 51.06% (Passed 46 times, Failed 48 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.84% (Passed 56 times, Failed 87 times)

Stack Traces | 84.5s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (84.47s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.78% (Passed 46 times, Failed 81 times)

Stack Traces | 32.3s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1270}}
Executing command bash in sandbox ipyc1sn420hudw7rluqvc (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 186 MB\nFree memory before tmpfs mount: 798 MB\nMemory to use in integrity test (80% of free, min 64MB): 638 MB\n"}}
Executing command bash in sandbox ipyc1sn420hudw7rluqvc (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"638+0 records in\n638+0 records out\n668991488 bytes (669 MB, 638 MiB) copied, 3.29215 s, 203 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=638\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.27\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.29\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2600\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 341\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 11\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ikmqs4f4u2x1p3t5h4c9j
Executing command bash in sandbox ikmqs4f4u2x1p3t5h4c9j (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1286}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"8240cd1f3d64d8308a2dfe22b0d59778d88e235af999cff020a245cef8e9301b\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox ikmqs4f4u2x1p3t5h4c9j
Executing command bash in sandbox ikmqs4f4u2x1p3t5h4c9j (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1289}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ikmqs4f4u2x1p3t5h4c9j: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (32.27s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The removal of nil checks for the uploaded atomic pointer in OpenRangeReader and tryPeer introduces potential nil pointer dereferences. A transition to storage occurring during a peer read attempt may be missed, causing the base storage to be opened with a stale compression type; re-checking the transition state after the peer attempt is necessary to ensure the caller refreshes the compression state correctly.

Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/seekable.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/seekable.go
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go
@levb levb marked this pull request as ready for review May 7, 2026 16:17
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1a2fab469

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/seekable.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go Outdated
…able boundary

- StoreFile now returns an explicit error: peerSeekable only exists when routing
  picked an active peer at open time, so being asked to upload that build is a
  contradiction. Dead in production today (writes use bare persistence), but
  prevents a silent wrong-path write if a future caller routes one through.
- Restore StripCompression in peerStorageProvider.OpenSeekable. Redis peer-key
  TTL outlives upload finalization by ~2 min; a fresh orchestrator can resolve
  a stale entry for a finalized V4/Zstd build, so StorageDiff hands us
  "buildID/memfile.zstd". Without stripping, getBase double-suffixes to
  "memfile.zstd.zstd" on fallthrough.
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/storage.go Outdated
@dobrac dobrac enabled auto-merge (squash) May 7, 2026 23:20
@dobrac dobrac merged commit 6d217f9 into main May 7, 2026
90 of 92 checks passed
@dobrac dobrac deleted the lev-fix-switch-path branch May 7, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants